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Abstract 

In this article we present a model of human written text based on statistical 
mechanics approach by deriving the potential energy for different parts of the 
text using large text corpus. 

We have checked the results numerically and found that the "specific heat" 
parameter effectively separates the closed class words from the specific terms 
used in the text. 

1 Introduction 

Let us imagine that we are looking for some article on the WEB. Probably the first 
thing we will do is to enter in Google and type some keywords. If we type a query 
like "I am looking for an article about statistical mechanics of images" , although it is 
exactly what we want, we will probably get nothing related to the subject or we will 
get only a content partially related to it. So, it would be better to refine the query to 
something like "image "statistical mechanics"" in order to get some reasonable results. 
To extract useful words we do not use the structure of the language - actually we 
ignore it. Also, we hardly use the common words of the sentence in the query. What 
we use are some statistical estimations of the parts of the query and words that stick 
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well with the meaning of the query. This leads us to the idea to regard the text as a 
statistical mixture of its parts that sticks well with its meaning. Of course, the text 
must stick well with the language in which it is written as well. Therefore, we can 
consider the text as conditioned to the language in which it is written. We can even 
consider that the text must stick well to the area in which it belongs, as for example 
"nonlinear physics" or "novels of 17**^ century" . 

Google is a product of some 15 years evolution of WEB page search. What this 
evolution shows is that if we are looking for the meaning of a text, we must look for 
specific, statistically salient keywords, that are supposed to be present in it, largely 
ignoring the syntactic and the semantics structure of the language. This gives us the 
inspiration to build a statistical mechanics model of a human written text, considering 
it as composed by its "particles" - the words. 

The best way to do the analysis of a text, written is some languag^, is to have 
some exact descriptions of the language, for example, weighted context free grammar. 
However, it is not clear if such a description exists, because usually people do not speak 
grammatically correctly. Some trivial grammars always exist, for example grammars 
that allow all possible strings of the alphabet. Due to the fact that these strings do 
not constrain the expressions, the information they carry regarding the statistical 
properties of the language, is very poor. Moreover, having in mind the Zipf's law 
of the frequency distribution of the words [1], even if reasonable grammar exists, 
in a single text of arbitrary length we will have some 40% halomorphemesH As a 
consequence, the length of the grammar will be of the order of the length of the text 
for any text we choose. Therefore, it is easier to consider the language as a set of 
all the texts spoken/written in that language. Using statistical arguments, we do not 
need all texts, but only a significantly large random set of texts in order to treat the 
problem. 

The model we investigate consists of a text T and a vocabulary V, written in one 
and the same language. The vocabulary is formed using all the words of some huge 
collection of texts, written in that language. 

A text that treats some well-defined subject is highly restricted by this subject. 
The language, as a whole, has no such restriction. Therefore, the relative excess (or 
higher frequency) of a word in the vocabulary is a normal situation. 

On contrary, the relative excess of a word in the text has a specific meaning. If 
the word is with much higher frequency of occurrence in the text than in the common 
language, that can be interpreted as an indication that this text treats exactly a 
subject expressed by this word, e.g. that the word is a specific term or keyword in the 
text. This is the first class of words in the text that we will consider in this paper. 

On the other hand, the text will always contain words that are common in the lan- 
guage, which have more or less the same frequency in any text and in the vocabulary. 
A large fraction of the words of that type will be formed by the so-called function 
words. These words by themselves carry no meaning, but are essential for expressing 
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the language structure. A typical example of a function word in English is the word 
"the" . A similar and more strictly defined category is the class of closed class words 
that, by definition, are the words which do not change their form in any text. 

Finally, the third class of words that will follow more or less the same frequency 
distribution in the text and in the vocabulary are the common words. They serve to 
transmit the meaning of the text, but are common for every text that must explain 
some concept, for example, like the word "explain" in this sentence. 

The paper is organized in the following way: In Section 2 we define the model and 
the approximations used. In Section 3 we calculate numerically the thermodynamic 
quantities for a set of arbitrary selected texts. We show that the specific terms (key- 
words) and the rest of the text have different thermodynamic behavior. Finally in 
Section 4 we present our conclusions. 



The relationship between some text and its language can be considered as a condi- 
tional distribution. In this article however we consider the following approach: 

We consider the vocabulary as a sohd state basement, composed by "molecules", 
which are actually the parts of the text (the words of the language). The text itself 
is considered as a liquid solution of "molecules", derived in the same manner as the 
vocabulary. The text and the vocabulary "react" and there exists some energy gain 
when the reaction takes place, so some "molecules" are settled down on the solid base. 

The excess of the "molecules" (words) of a given type in the vocabulary, e.g. in 
the "solid" compound, has no significant meaning. Therefore, we can concentrate only 
on the "liquid" phase, considering this phase gnificant one. Equivalently, we 

can consider only on the deposited part of the "molecules" that have been entered in 
reaction with the vocabulary. The molecules as a first approximation can be assumed 
to react only if they represent one and the same word. 

More rigorously, our model consists of a vocabulary with length L^, a text of length 
Lt, the "molecules" (words) of the text, w, that are matched with the "molecules" of 
the vocabulary and the corresponding number of occurrences of these "molecules": 
nt{w) for the text and n^(u') for the vocabulary. In order to fulfill the requirement of 
equal molar mass we can introduce some standard text length Lq and normalize the 
number of occurrence of w according to this length: 



For convenience we choose Lq = Lf in the numerical experiments. We denote by m{w) 
the number of deposited molecules, normalized to a length Lq. This parameter will 
be used below as an order parameter for the system. 

The problem of regarding the text as a thermodynamic system consists of defining 
the "molecules" w and the energy of interaction E{w) — E{m{w), Nt{w), A^„(w), Lq) 
between them. 



2 The Model 




V 
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In this article we will regard the "molecules" as usual Enghsh words, consisting of 
continuous strings of letters, separated by non-lcttcr symbols. In the rest of the article, 
we will not distinguish between "molecules" and words. We will assume that the words 
are independent, e.g. that there is no interaction between the different words. Due 
to this independence, the extensive thermodynamics quantities, as for example the 
free energy, will be given by the sum of the individual quantities corresponding to 
the different words. Therefore wc can build a theory, based on a single word, and 
extrapolate it on the whole text. 

Further, we will consider that the language (the solid compound) imposes some 
potential energy field depending on the parameters Ny, Ly, but not on the text, e.g. 
not on Nf. We also assume that the system is in thermal equilibrium. 

According to the general thermodynamics principles, the state of the system can 
be described only by its energy E. The probability P{m) of the state with m deposited 
molecules (words) is: 

P(m) oc G{m) exp(-/?£;(m, Nt, 7V„, Lq)), (1) 

where E{m, Nt, Ny, Lq) is the energy of settling m molecules, G{m) is the number of 
degeneration of these states and (3 is the inverse temperature (3 = 1/T. 

The number of degeneration is just the number of ways one can select m molecules 
out of a set of iV^ molecules, e.g. (^^^ . Note that this number is strictly zero iim > 
due to the fact that we have only Nt molecules. 

Regarding that system, one can impose the requirement that its thermodynamic 
properties scale with the length of the texts, e.g. if we scale simultaneously the size 
of the vocabulary and the size of the text by s, then the thermodynamic potentials 
scale as: 

E{sm, sNt, sNy, sLq) = sE{m, Nt, N^, Lq) 

and 

log(G'(sm)) =slog(G(m)). 

Considering the frequency of occurrence of a single word w in a text with length L, 
in order to speak about something measurable, we must regard the case where the 
word occurs in the text a; ^ 1 times. 

Let us assume that we have a language derived from some context free weighted 
grammar with terminal symbols — the words of the language. Being interested only 
on the frequency of the words, we can transform any language definition A : aw^ into 
A : wa'j. Further, all words different from w can be regarded as one and the same 
word, say v, thus obtaining a grammar with only two terminal symbols. We can join 
further in one symbol all non-terminal symbols that can not produce w. 

The simplest grammar of this type is: 



S : 


wR 


R : 


vR 


R : 


A 



(2) 
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where S is the axiom, R is an auxihary non-terminal symbol and A is en empty string. 
Within the square brackets we represented the probabilities for the corresponding 
definitions. This grammar generates sentences containing w with lengths that have 
exponentially falling frequencies distributions. 

Fixing the length of the text, we can obtain the probability distribution of w as a 
sum of finite number of exponentially distributed variables. By definition [2], this is 
the gamma distribution: 

P(x;m;) =e-"V-^67r(a). (3) 

Here the parameter a is proportional to the length of the text L, while the parameter 
b does not depend on it, but rather on the word and the class of text we are regarding. 
Although the above consideration is not strict, it can give an idea about the type of 
the word frequency distribution for a text with a fixed length. Using a general context 
free grammar, one can find the expressions for the probability and the length of the 
sentence, following the techniques developed in |3]. 

We have checked the hypothesis for the gamma distribution on a set of about 
19000 English texts given by the Gutenberg collection and have found an excellent 
agreement with the experimental data for all the words with frequency of occurrence 
p{w) > 5/10000. 

Later we have studies the asymptotic behavior of the distribution. For this aim, we 
have replicated the text s times and have considered the limit lims^oo [log P{sx; w; sa, b)]/s. 
One can easily find that this limit is a — bx — a log a + a log x + a log b. Using that the 
mean of a; is x = ab, we finally obtained the following expression for the asymptotic 
behavior of P{x): 



Ep{x; w) = log P{x) = xb 



X , fx 
1 - - + log - 

X \x 



(4) 



Ep can be regarded as a potential energy of the word w in the language. The linear 
member accounts for the excess of words of a given type in the text, while the loga- 
rithmic one corresponds to the entropic part of the energy . A typical energy curve 
is given in Fig. [TJ 

Using the above considerations, the corresponding partition function for a given 
word w is: 

Nt 

Z{w,(3) = J2 G{m,Nt))exp{-PEp{m,Nt)), (5) 

m=l 

where (^(m, Nt) = (^'^^ and we have omitted the argument w in the right hand side 
of the equation. Thus, the expression for the partition function is: 

Nt 

Z{w,(3) = ^ exp{-(3EUm,Nt)), (6) 

m=l 

where 

Etot{m,Nt) = —log PI + 



N,b 



m f m 



(7) 
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Figure 1: Potential energy of a word according to the number of words. It consists of two 
parts - the logarithmic falling part, varying for values of the argument from zero to the 
mean frequency of the word, and a linear increasing part, predominant at the range where 
the frequency of the word is larger than its mean frequency. 



is the total energy corresponding to a given word w. It is composed by a potential 
part Ep and a combinatorial part ^\ogG{m, Nf). 

Finally, the full free energy of the text is given by the sum over the different words 
of the text: 

F{P) = -W\ogZ{w,P). (8) 

P w 

The equation for the order parameter m can be obtained by using the saddle-point 
method. Its application, combined with the Stirling approximation, log N\ ^ N log A^— 
N, N ^ 1, gives the following equation for the order parameter m: 

~1 — = ^ T7 + b = 0. 9 

dm p JMt — m m 

This equation can be solved in closed form and the solution is: 

^ = N ^/^^-/^^ no) 

* bpNjNt + W{h(3NjNt e^P-bPN./Nt) ' ^ 

where W{.) is the Lambert W function [2]. 
The entropy S for a single word is: 

dF 

S = -— = Nt log Nt - m\ogm - 
oT 

{Nt-m)\og{Nt-m). (11) 

Substituting Eq. (fTOj) in Eq. (ITT]) , we obtained the behavior of the entropy as a 
function of the inverse temperature and the ratio N^/Nt shown in Fig. [2l 

Finally, the second derivative of the free energy, which is related to the "specific 
heat". 
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Figure 2: The entropy for a single word. 
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Figure 3: The "specific heat" Cy. 



is represented in Fig. [31 We have used the notation adopted in the thermodynamics for 
isochoric process, Cy, although what is fixed in the present approach is the number of 
occurrences for a given word. A section of that figure for 6 = 1, iV^ = 5 is represented 
in the right panel of the same figure. 

3 Numerical experiments 

To check the above results experimentally on real texts, we used a collection of about 
19000 English texts from the Gutenberg project (GC) with size of 5.10'' words (GC). 
We also used a collection of 500 articles from the area of the non-linear physics (NL) 
given by the repository xxx.archiv.gov. In order to avoid problems with the different 
multiple versions of the articles, we used only the first version of each one. Finally, 
we used a list of 257 closed-class words of English instead of function words. 

For estimating the parameters a and b of the distribution of a single word, we used 
GC. We found that b is within the range [0.01-20] with an average value 0.25. The 
value of the parameter a, for a text with a length L = 10000, is within the range [0-2.6]. 
Note that the parameters a and b are well defined and with a sufficient confidence 
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only if p{w)L ^ 1. For practical purposes we can say that within a corpus of 10* 
words, these two parameters are well defined only for the 2400 most frequent words. 
For the rest of the words, we used some simplifying assumption, due to the fact that 
one can not prove or disprove reliably the hypothesis by using two degrees of freedom 
(a and b), having less then four measures for their estimation. The hypothesis, we 
have adopted, was that the less frequent words have the same value for the parameter 
b of the distribution. Thus we can join all the words, that are not frequent enough, 
and estimate that parameter. The results are very close to the mean vale of b. The 
parameter a, being proportional to the length of the text, is not so critical for the 
estimation (actually we need only and b). 




Figure 4: The specific heat Cy for different words belonging to one and the same text. The 
upper two curves in the left panel represent two different terms ("topology" and "topolog- 
ical"). The lower curves of the left panel represent two function words ("the" and "are"). 
On the right panel, the curve of the word "are" is zoomed in order to represent it together 
with the typical common word "important". 

Figs, m show the typical behavior of the specific heat Cy for different kind of 
words: for terms (the two upper curves on the left panel), for function words (the 
two lower curves of the same panel) and for common words (the lower curve on the 
right panel). One can observe that Cv{Nt, N^, b) represents different behavior for the 
different word classes with corresponding maxima belonging to different temperature 
ranges. 

Because the function words have a higher frequency of occurrence, one can expect 
that they play a predominant role for the behavior of the specific heat. As we can 
observe, this is not true: the specific heat for the terms is much higher than the one 
corresponding to the function words. These results can be interpreted as an indication 
than the most vulnerable parts of the speech are carried by the common words, while 
the most resistant ones are carried by the domain-specific terms. 

4 Conclusion 

In the present article we proposes a statistical mechanics approach for the analysis of 
the human written text. By introducing an energy, that describes the system, taking 
into consideration a realistic distribution of the words inside a large text corpus, we 
were able to derive the thermodynamic parameters of the system in a closed analytical 
form. 
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By studying the behavior of the specific heat of the system, we have shown that 
this quantity is different for different kinds of words (terms, function words and com- 
mon words). 

We have apphed the above method to different corpora of texts and we have found 
one and the same universal behavior, which does not depend on the particular text. 

Our numerical results show that the "specific heat" effectively separates the closed 
class words from the specific terms and the common words used in the text. 
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